alt text

Project: Investigate Gapminder World Database (A Glance to Countries GDP)

Table of Contents

Introduction

This dataset is collected and downloaded from Gapminder world which has collected a lot of information about how people live their lives in different countries, tracked across the years, and on a number of different indicators.

I choose for this study four individual indicators from Gapminder datasets, one dependent indicator, total GDP is US dollars and three independent indicators: Employment rate, average schooling years (OWID) and total electricity generation in kilowatt.

I choose GDP as the key dependent indicator for this study due to its importance in measuring countries’ economies. It gives essential information on the size of an economy and how is it performing. The general health of an economy is often indicated by real GDP growth rate. For instance, an increase in real GDP is interpreted as a sign of good performing economy (source: IMF).

The three independent indicators are as follow:

This report will explore and try to answer the following questions based on the previously mentioned indicators:

  1. Which country had the highest increase in the total GDP in a given period? Which country had the highest decline?
  2. How is the trend of GDP total growth over the years in these two countries?
  3. which top five countries achieved the highest total GDP growth and which 5 countries has the lowest in a given period?
  4. which country has the highest share of the total world GDP (considering only the countries included in the dataset after cleaning)?
  5. which country is responsible for the highest share of electrical energy generation in the world (considering only the countries included in the dataset after cleaning)?
  6. How is the trend of average employment rate over the years between countries, which countries had the highest employment rate average?
  7. which of the mentioned independent indicators effected the GDP and how?

Data Wrangling

In this section of this report, I will load the data, check its cleanliness, trim and clean and merge datasets for analysis. further explanation of the step-by-step data wrangling process will be included along the coding process.

General Properties

First, we load the datasets and dive into properties of each, review it and analyze it for possible cleaning and modification.

Note that the end-year in the OWID dataset is year 2017, it also has data for years way before other datasets

Note that the start-year in employment rate dataset is year 1991, also this dataset includes only 33 columns

Data Cleaning

Data cleaning will be performed step by step to reach the best possible form to analyze and explore the data to answer the underlying questions

Matching the Datasets columns shape

Before checking for non-values, for the sake of comparison and clear analysis between given datasets, I will drop columns, so the datasets are matching in terms of years. For instance we can see from the datasets shapes, that the dataset with the smallest number of columns is employment rate dataset. but we also notice that the OWID dataset has in its columns the end-year is set to 2017 which is coming before the end-year in other datasets. Hence, as a first step I will drop years columns before year 1991 to reach the same start-year column of the lowest dataset in columns shape. Furthermore, I will also drop columns of years after 2017 in a second step.

Step one: dropping columns before 1991

Step two: dropping columns after 2017

Dropping non-value rows

Now that we have a specific timeline for our analyses after matching all datasets in terms of columns, the second step would be to drop rows including null values. I choose to do this step later after dropping the columns in order for us not to lose too many country data as many countries had null values in the early years of the most of the datasets. After this step we will look into the best way to join the data together for the purpose of exploring the underlaying questions.

Notice employment rate and electricity datasets has no null values

Trimming and adjusting individual datasets

Although merging the given dataframes is essential for exploring some of this study's questions. I choose to do it at a later stage as other questions will need us to work on individual datasets to explore them. This is also why; I choose to drop the null values from each dataset individually instead of dropping them after merging the datasets.

GDP dataset toning and trimming 1

In this section I will trim and tone GDP dataset to create a new dataset that will help us explore the first two questions of this study, to do this I will implement the following steps:

  1. Add new column to gdp dataset including the total growth incurred in the dataset given period of time by deducting first year value from last year value for each country
  2. Explore both the maximum and minimum GDP growth values in the dataset
  3. Create new dataset including only the two countries facing both the maximum and minimum gdp growth values

GDP dataset toning and trimming 2

Here, I will trim chi_ukr_gdp dataset to create a new dataset, then I will miniplate this new dataset to help us explore the 3rd question in this study. I will need to convert rows of the data set into columns to be able to use them as basis for my analysis, to do this I will apply the following:

  1. I will drop the previously created GDP total growth column.
  2. I will switch between rows and columns so that rows are columns and columns are rows
  3. Finally, I will add a column for the years

Education (OWID) dataset toning and trimming

In this section I will trim and tone OWID dataset to create a new dataset that will help us explore more questions of this study, to do this I will implement the following steps:

  1. Add new column to the OWID dataset including the increase schooling average rate incurred in the dataset given period of time by deducting first year value from last year value for each country
  2. Create new dataset including only the two countries previously found in the gpd dataset wrangling (Ukraine & China)
  3. Dropping the added increase in average education rate column from the OWID dataset as it's no longer needed and may cause conflict in further analysis using the same dataset

Electricity dataset toning and trimming

In this section I will trim and tone electricity generation dataset to create a new dataset that will help us explore more questions of this study, to do this I will implement the following steps:

  1. Create new dataset including only the two countries previously found in the gpd dataset wrangling (Ukraine & China)
  2. I will switch between rows and columns so that rows are columns and columns are rows
  3. Finally, I will add a column for the years

Employment rate dataset toning and trimming

In this section I will trim and tone employment rate dataset to create a new dataset that will help us explore more questions of this study, to do this I will implement the following steps:

  1. Add new column to the Employment rate including the increase employment rate incurred in the dataset given period of time by deducting first year value from last year value for each country
  2. Create new dataset including only the two countries previously found in the gpd dataset wrangling (Ukraine & China)
  3. Dropping the added increase in employment rate column from the emp dataset as it's no longer needed and may cause conflict in further analysis using the same dataset

Merging Datasets

As a final step of the data wrangling, I will merge all the 4 datasets included in this study into one big dataset. the merge will be done using merge inner method to include only shared years data of all the datasets. this will be done following these steps:

  1. Turning rows into columns for all the datasets
  2. merging all the data sets together using inner merge

Exploratory Data Analysis

Now that, the data wrangling and cleaning stage is over, we start computing statistics and explore this report underlying questions.

Research Question 1: Which country had the highest increase in the total GDP in a given period? Which country had the highest decline?

Within the final GDP dataset used in this study, China is the country with the highest growth in total GDP with a total amount of 9295000000000 $ while Ukraine had the highest decline in GDP during the time period from 1991 till 2017

Research Question 2: How is the trend of GDP total growth over the years in these two countries?

We can see from the graph that while Ukraine had a constant decline in GDP over the years, China's GDP has been rapidly growing

Research Question 3: which top five countries achieved the highest total GDP growth, and which 5 countries has the lowest in a given period?

The top 5 countries that have incurred the highest GDP growth between 1991-2017 are:

  1. China
  2. USA
  3. India
  4. Japan
  5. UK

Least 5 countries that have incurred the lowest GDP growth between 1991-2017 are:

  1. Ukraine
  2. Tuvalu
  3. Marshall Islands
  4. Micronesia, Fed. Sts
  5. Kiribati

Research Question 4: which country has the highest share of the total world GDP (considering only the countries included in the dataset after cleaning)?

The pie chart shows that USA have the highest total GDP share of the sum of the total GDP over the dataset years with a share of 25.3%

Research Question 5: which country is responsible for the highest share of electrical energy generation in the world (considering only the countries included in the dataset after cleaning)?

The pie chart shows that USA responsible for the highest share of the total electricity generation of the world over the dataset years with a share of 24.3%

Research Question 6: How is the trend of average employment rate over the years between countries, which countries had the highest employment rate average?

We can see from the graph that average employment rate differs from one country to another. Countries who have the best average of employment rate are:

  1. United Arab Emirates with rate of 75.7%
  2. Vietnam with rate of 75.6%
  3. Iceland with rate of 72.9%

Research Question 7: which of the mentioned independent indicators effected the GDP and how?

The Graph shows to a certain extent a week positive relation, however it doesn't conclude the exitance of relationship between GDP and OWID

The graph shows no clear correlation

According to the graph there could be a strong positive relation between Electrical energy infrastructure and GDP, however, it can conclude it

Conclusions

As a result of my data wrangling, cleaning, analysis and exploring I can conclude the research questions answers as follow:

  1. Dataset Statistics shows that China is the country with the highest growth in total GDP with a total amount of 9295000000000 USD while Ukraine had the highest decline in GDP during the time period from 1991 till 2017.

  2. Ukraine has a constant decline in GDP over the years and China GDP is rapidly growing.

  3. The top 5 countries that have incurred the highest GDP growth between 1991-2017 are (China, USA, India, Japan and UK). Least 5 countries that have incurred the lowest GDP growth between 1991-2017 are (Ukraine, Tuvalu, Marshall Islands, Micronesia, Fed. Sts, Kiribati).

  4. USA has the highest total GDP share of the sum of the total GDP over the dataset years with a share of 25.3%.

  5. USA is accounted for the highest share of the total electricity generation of the world over the dataset years with a share of 24.3%.

  6. that average employment rate differs from one country to another with some countries skewing higher than others, Countries who have the best average of employment rate are (United Arab Emirates with rate of 75.7%, Vietnam with rate of 75.6%, Iceland with rate of 72.9%).

  7. when observing the relation between total GDP and other independent indicators, we find that total GDP and electricity generation seems to be directly proportional, however. we cannot conclude this without further studies and analysis for other possible reasons for the apparent relation.

Limitations

This study has some limitations, some of these limitations are:

  1. The study is limited to the countries and years included in final dataset, it takes the time period from 1991 to 2017 only and missing countries like Afghanistan for example.

  2. Dropping missing or Null values from datasets might skew this study analysis and could show unintentional bias towards the relationship being analyzed. etc.

  3. the study is limited to the given indicators and doesn't include other important indicators such as consumer price index for example

Implications

Our implications for future research regarding our research topic are as follow:

  1. Including other indicators such as consumer price index, investment and export volume can help us understand more about factors effect GDP.

  2. I also recommend you include other indicators for education such as education costs or number of educational institutes and its ratings to have more clear representation to the education indicator for a given country and does it relate to GDP

References

A list of references links used in these studies is attached in the zip file of this study as a readme text file